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Abstract.  This  paper  describes  the  participation  of  the  EMSE  team  at 
the  clinical  decision  support  track  of  TREC  2015  (Task  A).  Our  team 
submitted  three  automatic  runs  based  only  on  the  summary  field.  The 
baseline  run  uses  the  summary  field  as  a  query  and  the  query  likelihood 
retrieval  model  to  match  articles.  Other  runs  explore  different  approaches 
to  expand  the  summary  field:  RM3,  LSI  with  pseudo  relevant  documents, 
semantic  ressources  of  UMLS,  and  a  hybrid  approach  called  SMERA  that 
combines  LSI  and  UMLS  based  approaches.  Only  three  of  our  runs  were 
considered  for  the  2015  campaign:  RM3,  LSI  and  SMERA. 
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1  Introduction 

Browsing  the  state  of  the  art  of  query  expansion  reveals  an  overwhelming  amount 
of  theories  [1].  If  the  retrieval  model  is  precise  enough  to  detect  relevant  doc¬ 
uments  at  high  ranks,  approaches  based  on  pseudo  relevance  feedback  perform 
quite  well  without  user  intervention.  On  the  other  hand,  most  of  these  approaches 
depend  on  word-based  statistical  calculation,  which  makes  them  unable  to  ex¬ 
plicitly  introduce  phrases  or  multi-word  named  entities  (assuming  a  word-based 
indexation).  This  issue  can  be  addressed  by  ontology  based  techniques.  Using 
an  external  resource  provides  the  system  with  semantic  information  which  leads 
to  valuable  expansion  terms  that  can  not  necessarily  be  obtained  by  feedback 
documents. 

Our  participation  in  the  clinical  decision  track  aims  to  evaluate  a  Seman¬ 
tic  Mixed  Expansion  and  Reformulation  Approach  (SMERA)  in  the  medical 
context.  This  approach  to  query  expansion  uses  ontologies  (UMLS)  and  a  lo¬ 
cal  approach  based  on  pseudo  relevant  feedback  documents  using  LSI.  A  brief 
description  of  our  submitted  runs  is  given  in  section  2.  A  detailed  explanation 
about  our  proposed  approaches  is  given  in  section  3  for  the  LSI  approach,  and 
section  4  for  the  hybrid  approach  SMERA. 


2  Our  runs 


We  submitted  three  runs  to  the  task  A  in  the  clinical  decision  track  of  TREC 
2015: 

EMSE_SumRM3  :  Query  expansion  using  pseudo  relevance  feedback  with  a 
language  model  [2]; 

EMSEJLSI  :  Query  expansion  with  pseudo  relevance  feedback  documents  using 
LSI  (cf.  Sect.  3); 

SMERA  :  A  mixed  query  expansion  and  reformulation  approach  that  uses  a 
combination  of  LSI  and  an  ontology  based  query  expansion  approaches  (cf. 
Sect.  4). 

Our  query  reformulation  method  considers  the  final  query  to  be  a  linear  combi¬ 
nation  of  the  user’s  original  query  terms  and  the  representations  of  the  expansion 
sets  obtained  in  the  expansion  step.  The  relevance  score  value  can  thus  be  ex¬ 
pressed  by  equation  1: 

k 

p(Q\d)  =  \Y[p(q\d)  +  (1  -  A)  P[  b(ri)Wi  (1) 

q  i= 1 

where  p(q\d)  is  the  query  likelihood  probability  for  the  original  query  term  q 
and  a  document  d,  corresponds  to  an  expansion  set  that  is  associated  to  at 
least  one  original  query  term,  b(ri)  is  the  belief  calculated  for  this  expansion  set 
according  to  the  Metzler’s  approach  [3],  finally  Wi  is  the  weight  of  the  estimated 
belief  of  the  representation  tv  In  this  current  participation,  expansion  sets  are 
considered  to  be  equally  important  to  the  query  so  Wi  was  set  to  one  for  all  i. 


3  EMSE-LSI  approach 


Several  approaches  exist  to  extract  concepts  from  a  set  of  documents  (like  LDA, 
ESA  or  LSI).  In  this  study  we  chose  to  apply  LSI  on  pseudo  relevant  feedback 
documents.  It  was  argued  that  LSI  can  detect  high  level  co-occurrence  relation¬ 
ships  between  terms.  This  means  that  two  terms  that  do  not  occur  together  in 
the  studied  set  of  documents  but  frequently  co-occur  with  a  third  term  will  be 
considered  as  semantically  related  by  LSI.  The  idea  is  to  do  singular  value  de¬ 
composition  on  a  matrix  A  of  m  lines  (corresponding  to  m  terms)  and  n  columns 
(corresponding  to  n  feedback  documents)  that  contains  the  frequencies  tf  of  the 
terms  in  the  feedback  documents.  The  result  of  this  step  are  the  three  matrixes 
presented  in  equation  2: 


As 


{m,n}  —  ^{ra,ra}^{ra,n}  v{n,n} 


Vs1 


(2) 


where  S  is  the  diagonal  matrix  that  contains  the  singular  values  of  A.  The  theory 
of  LSI  is  that  reducing  the  dimension  of  the  three  resulting  matrixes  gives  an 
approximation  of  the  original  matrix  A  while  reducing  the  noise  (equation  3). 


A[m,n }  —  U{m,k}S{k,k}V{k,n} 


(3) 


For  our  case  of  query  expansion,  we  are  interested  in  the  matrix  This 

matrix  contains  the  m  vectors  of  terms  appearing  in  pseudo  relevance  feedback 
documents.  These  vectors  belong  to  the  semantic  space  of  k  dimensions  created 
by  LSI  (cf.  Fig.  1).  To  find  the  expansion  set  of  a  query  term  q ,  we  measure  its 


•  :  A  term  that  appears  in  feedback  documents 

O:  A  term  that  appears  in  feedback  documents  and  in  the  query 


Fig.  1.  Terms  of  feedback  documents  in  the  semantic  space  of  LSI  (example  for  the 
case  of  2  dimensions  kl  et  k2) 


similarity  with  the  m  terms  that  appear  in  the  feedback  documents  based  on  the 
euclidean  distance.  We  then  suppose  that  the  terms  that  are  the  most  similar 
to  q  belong  to  the  same  implicit  concept,  as  presented  in  Fig.  1.  In  some  cases, 
two  original  query  terms  are  strongly  related  to  each  other.  These  two  terms  will 
have  the  same  statistics  in  the  feedback  documents,  and  obtain  identical  vectors 
in  the  semantic  space  generated  by  LSI  (cf.  Fig.  2).  In  this  case,  we  consider  that 
these  original  query  terms  belong  to  the  same  implicit  concept  (c2  in  Fig.  2)  and 
will  both  correspond  to  one  expansion  set  in  the  reformulated  query. 

In  our  run  in  TREC  2015  we  used  query  likelihood  language  model  [4]  to 
retrieve  pseudo  relevance  documents.  Twenty  documents  were  used  to  construct 
matrix  A.  For  LSI,  10  dimensions  were  considered.  A  was  tuned  to  0.5  (cf.  Equa¬ 
tion  1)  and  sets  of  three  expansion  terms  were  built. 

4  EMSE-SMERA  approach 

SMERA  is  a  mixed  approach  that  combines  both  the  LSI  method  with  pseudo 
relevance  feedback  documents,  and  a  semantic  method  based  on  UMLS  concepts. 
The  LSI-based  method  was  used  only  to  expand  summary  terms  that  can’t  be 
matched  to  UMLS  concepts.  Medical  terms  are  disambiguated  using  MetaMap, 
which  results  in  finding  unique  concepts  in  the  UMLS  semantic  ressources.  Con¬ 
cepts  names  and  ’’preferred  names”  are  then  used  as  expansion  terms  and  added 


k2 


•  :  A  term  that  appears  in  feedback  documents 

O:  A  term  that  appears  in  feedback  documents  and  in  the  query 


Fig.  2.  The  merging  of  expansion  sets  in  the  case  of  query  terms  that  are  semantically 
close  in  LSI  semantic  space 


to  the  reformulated  query.  Temporal  concepts  were  not  explicitly  eliminated  with 
this  approach.  Parameters  of  this  run  were  fixed  as  followed  :  for  the  LSI  part 
of  the  approach  we  used  20  documents,  5  dimensions  and  3  expansion  terms;  for 
the  UMLS  part  we  used  the  matched  concept  name  (retrieved  by  MetaMap)  and 
the  preferred  concept  name  as  the  expansion  term,  A  was  also  set  to  0.5. 
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